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ABSTRACT 


The estimation of the parameters of a linear statistical 
model is generally accomplished by the method of least 
squares. However, when the method of least squares is 
applied to nonorthogonal problems the resulting estimates 
may be significantly different from the true parameters. 
The method of ridge regression may provide better estimates 
in these cases; however, a probability distribution of the 
ridge estimator is presently not known. The form of such a 
distribution is dependent upon how the ridge parameter, k, 
1s selected. Two possible objective methods of choosing k 
are examined to determine if either one leads to a useful 


peobability distribution. 
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I. BACKGROUND 


The following conventions will be used throughout. 
Unless otherwise noted, capital letters and Greek letters 
will refer to matrices and vectors while lower case letters 


will refer to scalars. 


A. INTRODUCTION 
The use of linear statistical models is widespread in 
Scientific fields of all kinds. Generally, the linear 


Statistical model is postulated as 
Y= Xe te CE) 


where Y is an n x 1 vector of n observed values of a 
dependent variable, X is an n x p matrix containing n 
values for each of p predictor (independent) variables, 
B is a p x 1 vector of p unknown parameters (or coefficients) 
mombe €Stimated from data, and € 1S an n x l vector repre- 
senting experimental errors. Usually, the experimental 
error is assumed to have a multivariate normal distribution 
With mean equal to zero and variance covariance matrix 
equal to a7 1 where o” is the scalar value of the common 
variance of the experimental errors. This assumption 
Will be made throughout this paper. 

In practice, the modeling problem is to estimate the 


parameters 8 from data Y and X. The most common method of 





doing this is called least squares eStimation or some- 
times ordinary least squares (OLS). The latter designation 
Will be used in this paper. 

Under certain fairly general and common conditions 
OLS is an adequate method of estimating 8. However, when 
the data is "ill-conditioned" or nonorthogonal OLS may 
yield poor estimates of the true parameters. 

Ridge regression (RR) has been proposed [Ref. 1] as an 
alternative estimation method that might yield better esti- 


mates under conditions where OLS does poorly. 


B. ORDINARY LEAST SQUARES 

For convenience, it is assumed that the elements of X 
are scaled such that X'X has the form of a correlation 
matrix. This is done by forming from each element x.. a 


a) 
new element x45 such that 


t = oe 
— we ye, (2) 


where x 1s the mean value of the elements of the jo 


independent variable and ve 1s its standard deviation 
times an appropriate constant such that the diagonal 
elements of X'X are equal to one. The OLS estimator of 


8 is then 


6 = (x'x) txry (3) 


Aw 


so long as (xx) 74 exists.' The estimator 6 is unique, 
unbiased and is the best linear unbiased estimator (BLUE) 
of 8 (it has the minimum variance among all linear un- 
biased estimators of 8) so long as E(Y) = X§ and 
eee XB.) (Y -XB)* = 71 where 5? is a scalar, as assumed 
previously. 
The OLS estimator 8 1s commonly used and is particularly 
useful when it can be assumed that Y is a multivariate 
normal vector with mean vector X8 and covariance matrix 
o*I, In this case, it can be shown? that the maximum 
likelihood estimator of 8 is the same as the OLS estimator 
and furthermore, since 8 is a linear function of the elements 
Oe Y, B has a multivariate normal distribution with mean 
vector equal to 8 and covariance matrix ae(@UNO) Soe This 
latter characteristic of B allows the use of hypothesis 
tests and the computation of confidence bounds. 
Unfortunately, in some cases X'X is "ill-conditioned" 
and OLS yields poor estimates. This typically occurs when 
an experiment is poorly designed or there are economic or 
physical restraints causing strong correlations among the 


predictor variables. In this case X'X, in its correlation 


matrix form, will not be orthogonal. 


meG@rea derivation and details of properties of the OLS 
estimator, see, for example, Ref. 2. 


*For example, see Ref. 2, page 182. 


Hoerl and Kennard [Ref. 3] address the eigenvalues of 
X'X (denoted by AG jee. ees DJS and: DOIntrOUEE tac 
nonorthogonal data are characterized by the smallest eigen- 
value Or) being much less than unity and that, since 
1s a lower bound for the mean squared distance 
between 8 and 8, then for X'X nonorthogonal, the difference 
between B and 8 has a high probability of being large. 
When X'X 1S nonorthogonal B is characterized by one or more 
of the following difficulties, for example: 
(1) large variance, 
(2) large magnitude of residual errors, 
(3) Ne One Cit Signs of parameter 
estimates. 
fee RIDGE REGRESSION 
A. E. Hoerl suggested [Refs. 1 and 4] that the large 
Variance of 8 for nonorthogonal data could be reduced by 


maceaddition of a constant k > 0 to the diagonal elements of 


X'X, thus yielding 


it 


B= (x'x + kI)7! xry (4) 


aS aS eStimator. Equation (4) is derived in Appendix A. 
Ak 

Note that for k equal to zero the estimator 8 is equal 

to the OLS estimator 8. Therefore, OLS can be thought of 


as a special case of ridge regression.’ Hoerl suggested 





*See Appendix B for a discussion of an even more 
general estimator. 
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the name "ridge regression" for this procedure because of 
its mathematical similarity to some of his earlier work 
[Ref. 5] on quadratic response functions. Appendix A 
contains a derivation of the ridge regression estimator. 
1. Mean Squared Error 

The rationale behind using the ridge estimator is 
to minimize the mean squared error (MSE) associated with 
the estimate instead of minimizing the sum of squares of 
residuals as is done in OLS.* Hoerl and Kennard show 


that the mean squared error is given by 
MSE = Variance + (Bias) ° (5) 


Furthermore, they show that variance is a monotonically 
decreasing function of k, that the squared bias is a 
monotonically increasing function of k and that the rate 
of change of variance, for nonorthogonal data and small k, 
1s considerably larger than the rate of change of the 
Squared bias. Figure 1 is a graphical illustration of 
these relationships. Hoerl and Kennard argue that it is 
possible to find some k > 0 such that the variance is 


greatly reduced while only a small amount of bias 1s intro- 


duced, thus yielding a smaller MSE than if OLS (k 0) 


*In the case of unbiased estimation, which OLS is, 
these are equivalent criteria. 


ih 
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were used. Indeed they show that if 8'B is bounded, then 
such a k always exists.- Thus, proper use of ridge regres- 
sion on nonorthogonal data insures a reduced MSE of 
estimation. 

The problem remains to select an appropriate 
value of k. Hoerl and Kennard [Ref. 6] suggest the use of 
two graphical devices as aids to determining an appropriate 
value of k. The first is the ridge trace, a two-dimensional 
plot of the elements of 2 as functions of k and the second 
is an eStimate of the squared length of the coefficient 
vector BB. The ridge trace 1s used to gain an under- 
Standing of the underlying correlations between the various 
predictor variables while the plot of a3" is used to 
subjectively determine a Suitable range of values of k. 
A typical ridge trace is illustrated in Figure 2 and a 
fMypical plot of BB. is depicted in Figure 3. Notice 
that ne. in Figure 3, decreases steeply for small k 
(k < 0.2) but in the range about 0.3 to 0.4 has become 
much less sensitive to further increases in k. 

2. Alternative Methods of Choosing k 

The previously described method of subjectively 
choosing a suitable value of k is the current method in 
use and appears to be useful. A major problem arises, 
however, because the method denies to the analyst know- 


Ak 
Hedge of the probability distribution of 8 and, therefore, 


any probabilistic inferences concerning the resulting 


ILS 
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estimator. Hoerl and Kennard have suggested a general form 
of ridge regression [Ref. 3] and an iterative method of 
determining k. In addition, Hemmerle [Ref. 7] has derived 
a closed form solution based on this method. Another 
possibility is to use the ridge trace or the plot of 

2 quantitatively to calculate a point value for k in 
such a way that the marginal probability distribution, f,,, 


may be determined. Two such methods using the ridge trace 


are examined in the next Section. 
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11. PROPOSED OBJECTIVE RULES FOR CHOOSING k 


The slope (rate of change) of the ridge trace curves 
or the absolute change of the ridge trace curves over a 
specified interval may be used to determine a value of the 
ridge parameter, k, objectively. These criteria are 
discussed here. 

Either of these criteria may be sensitive to the 
behavior of each coefficient 8. In general, a is not 
monotonic in k, although they all approach zero as k is 
increased without bound. It has been noted by Marquardt 
and Snee [Ref. 8] that it is not uncommon for one or more 
oe to increase in absolute value as k is increased. (See, 
for example, a in Figure 2.) Therefore, the ridge trace 
should be examined by the analyst to detect any behavior 
of B that might adversely affect the proper selection of k 
even though the ridge trace is not to be used directly to 
select a specific value of k. 

It 1s clear that g* is distLibuted Multivyarlate nernadl 
if Y is distributed multivariate normal and a specific 
value of k is selected a priori. However, whenever the 
value of k is dependent on a data sample its value will 
not generally be the same for each data sample. Therefore, 
k 1S a random variable. Let K denote this (scalar) random 


Variable. 


i? 





Ak 
The marginal probability distribution of 8 may be 
derived from the joint probability distribution of K and 


a which can be determined by 


dere ee ee a 3 (6) 

se ls B /K 
meethe conditional distribution of 8 given K, fa" 7K? and 
the marginal distribution of K, fu are known. As stated 
above, when K is given, the distribution of 3" is known. 
It remains to determine the marginal of K, fue Gledamily: 
this distribution depends on how K is related to Y. The 
procedure will be to find a mapping from the range of Y 
into the range of K which gives the marginal distribution of 
K. With this distribution and the known conditional distri- 
bueton of a” Ciuc were JOInt distribution Of a and K may 
be determined. It iS convenient to consider the cumulative 
aeotribution function, FE (k), Since, if the functional 


relationship of K to Y, K = h(Y), is known then 
Fy (k) tae <—igle = Halin(y) < kl = P[YeR, ] (7) 


where Ri 1s a region in the space of Y corresponding to 
meee. K. Thus if Ry can be determined then, since the 
marginal distribution of Y is known, Fy (k) = P[YeR, ] can 


be determined and f, may be determined from Fy by 


K 
differentiation. It remains to determine Ry corresponding 
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to a specified region in the space of K and an objective 


rule for mapping from Y to K. 


A. ABSOLUTE VALUE CRITERION 

The practical range of the ridge parameter is taken to 
femee- K < Il ain the literature. It seems reasonable then 
to choose the smallest value of k such that all B (k) are 


close to their respective values at k = 1. In other words, 
AR ak . 
Ee @omomeae)))| < 65 2 = 15°25 2 3p (8) 


where és is a constant selected by the analyst. The cri- 
terion expressed by (8) means that the ridge trace curves, 
ak 
B 


,» at k are within 6. of their value at k = 1 beyond which 


there is no interest. Here 6. refers to the ite scalar 
component of a p x 1 vector, 6. Suppose that at some 

k = Ky the mth component of the left hand size of (1) is 
the one whose absolute magnitude is largest. Define a 

eee vector t such that ae +6 as appropriate, and the 
other peTDeneits of t are equal to the corresponding values 
Ge 18; (ky) - 8. (1) |. Then equation (8) can be rewritten 


in vector form 


Ak Ak 
Cc Ch) =x (9) 


i 





B. DERIVATIVE CRITERION 
Another potential criterion to use for selecting k is 
3 
to require that the slopes of all B; be "flat enough" in 


the sense that 


—sp = (98 P= 1, 2, .. «, PD (10) 


where 6. 1S as previously defined. Define m such that the 


we component of the left hand side of (10) is the one 


whose absolute magnitude is largest and define a p x l 


vector # such that 7, = +6,, aS appropriate, and the other 


components of m are equal to the corresponding values of 
axk 
se 
dk '° 
Then equation (1) can be written, in vector form 





Ak 
cs (kK) 2 gf (11) 
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Til SS PROEBELEM 


The problem is to determine the probability 
distribution of K given Y. It is proposed to determine 
this by attempting to derive and examine the functional 


relationship of Y and K. 


A. ABSOLUTE VALUE CRITERION 
The criterion expressed by equation (9) may be stated, 


by substituting from equation (4) 


1 


(X'X + KI) 7txry = Ot Xe) RY = GL) 
and by factoring 
i Caee Tat) Be ee t@iveres in caren (13) 


but, as shown in Appendix C, equation (C-4), the expression 


in brackets may be expanded to 


Box + kl) -[(x'x + 1) = COX + KI) OX + 1) * a 


Therefore, by canceling terms and simplifying, equation (13) 


becomes 


1 


(1 - ky(x'x + kD) tex + rT txty = (15) 


743 





1 


If k # 1 and if (x'x + kI) 7! ana (x'x + 1)? exist, then 


XTY = (pp) (X'X + KID(XIX + I)t (16) 


The task then is to solve the linear equations in (16) 
for Y in order to determine Ry. Unfortunately, equation (18) 
represents p linear restraints (hyperplanes) on n unknown 
variables where, in general, n > p. Furthermore, t is a 
eamctlon of Y. Thus, Ry 1s not easily determined under 


Hares Criterion. 


B. DERIVATIVE CRITERION 
The criterion given by equation (11) may be stated by 


substituting from equation (4) 


S(X'X + kI)"*XtY] = 1 (17) 
ye : 
or since aoe ee i 
“(x'x + kI) xy = & (18) 


Now, if (X'X + kI) is not singular then 
ene omoane ag) s std (19) 


hos 





where the negative sign has been dropped since the 
criterion actually specifies the absolute value of the 
components of the derivative and the notation of tT accounts 
for proper signs. 

Equation (19) is similar to equation (16), as it should 
be since the criteria are similar, and the same difficulties 
are encountered in determining Ry as for the previous 
criterion. In addition, the derivative of m7 will be 
difficult to determine. Therefore, the derivative 


criterion does not lead to a useful result either. 


Zs 


IV. NOTES ON THE FULL BAYESIAN RIDGE ESTIMATOR 


The full Bayesian ridge estimator (FBRE) is suggested 
by Eskew [Ref. 9] and is given as 


Ak -] 
8 = (X'X + kI) 


Cape: KB 9) (20) 
where Bo 1s a prior estimate of 8. There are two interesting 
Ak 
Bmoperties of 8 not noted by Eskew. 
First suppose that the prior BG ESeCHOSeCHMLO“Dememe 
OLS estimate 8. Then 


3 1 


6 = (X'X + kI)7 ; 


PODER eG) =) (21) 
and hence 

aR " a | a 

(2) Se OIG a OO OC IG! (22) 


But 


1 1 


lee Heke) |] = Oxi + Keune) ox3) 
Substituting (23) into (22) 
Ak = all A 
Ome ee x) Gee = 78 (24) 
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Thus if the OLS estimator is used as a prior estimate 
for the FBRE, equation (21), then the resulting estimate is 
equal to the OLS estimate. 

Now, suppose that any prior estimate B 6 is used in 
equation (21) but the resulting estimate is then used as a 
prior in (21) to compute another estimate. If this pro- 
@ecaure 1s repeated indefinitely, in the limit the result 
Will again be the OLS estimator regardless of what prior, 
By» was initially used. The proof of this is shown in 


Appendix B. 
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V. CONCLUSIONS AND RECOMMENDATIONS 


A. CONCLUSIONS 

The determination of a probability distribution of the 
ridge estimator, oa. 1s desirable in order to facilitate 
the use of hypothesis tests and the computation of confi- 
dence bounds concerning ve The probability distribution 
of B* depends on the objective rule used to select the 
ridge parameter, k. Neither of the two objective rules 
examined here appears to lead to a simply determined 


probability distribution. 


B. RECOMMENDATIONS 

The search for a useful probability distribution of 
B Should be pursued further. In particular, the closed 
form solution for k presented by Hemmerle [Ref. 7] may 
prove fruitful. Other possibilities include investigating 
other criteria based on the ridge trace such as minimizing 
the sum of squares, over all i=1,2,..., p, of the 
difference between B. (k) and g. (1). Also, the same 
criteria applied to the ridge trace could be considered 


Ak 
for the squared length of 8 . 
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APPENDIX A 


DERIVATION OF THE RIDGE REGRESSION ESTIMATOR 


The residual sum of squares for any eStimator can be 


written 


O(B)y= {Y= XB)! (CY ~=9X8)> = ee (A-1) 


In ridge regression it is desirable to minimize the 


residual sum of squares subject to an acceptable length, 
Ak 


c, of the regression vector 8 . Expressed as a Lagrangian 
restraint problem this is 
; Ax A* Ak AKITA ? 
fon O'(B ) = (Y = XB )' (Y = XB ) + kB B = ec) (A-2) 


where k is the inverse of the Lagrangian multiplier. 
Ak 
Taking partial derivatives of 6' with respect to 8 


and setting them equal to zero 


2) 0 
OS” 
08 
a) Ak Ak 1 Akt Ak AKITA 7 
= —xye [Y'Y - Y'XB - B X'Y + B X'XB + kB 8B ] (A-3) 
0B 
Hence 
Ax Ak 
OO] = XY I Xe ae 2kB (A-4) 


ca 


OT 


A Ak 
ZX UY @=— 2X XB 7 ZR (A-5) 


Therefore, 


KY (X'X + kI)e (A-6) 


Now, if (X'X + kI) is non-singular (which k is selected 


to ensure), then 


ON se 


s -1 
Be OX Ar ed  aey Ce) 
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APPENDIX B 


FULL BAYESIAN RIDGE ESTIMATION 


A. BACKGROUND 

Eskew [Ref. 9] points out that ridge estimation is 
equivalent to minimizing the squared differences between 
the regression estimates and a prior estimate of zero 
subject to a constraint on the sum of squares and suggests 
that a non-zero prior might be more reasonable. Following 
this line of reasoning he derives the full Bayesian ridge 


estimator (FBRE) 


B= (x'x + kI) lexty + kB 9) (B-1) 

where Bo 1s a prior estimate of the true parameters 8. 
Note that the ridge estimator is a special case of FBRE 
where the prior is taken to be zero. 

Eskew shows that the variance of the FBRE is the same 
as the variance of the ridge regression estimator (RRE) 
while the squared bias of the FBRE is less than that for 
the RRE, thereby resulting in a reduction of mean squared 


cerOr. 


B. ITERATIVE USE OF THE FULL BAYESIAN RIDGE ESTIMATOR 
Suppose that the FBRE is calculated using any prior, 


By» and then the result, By > 


Asx 
calculate another FBRE, 8.. If this procedure is repeated 


is used as a prior to 


ss) 





m times the result may be written 
co = il m 
een U2 9 EO 7.0 eee 40 SO 7.0) (B-2) 
1=1 


where A = (X'X + rail) oes 
ak 
form of c= in the limit as m approaches infinity. Since A 


It is interesting to determine the 


and X'X are positive definite matrices their elgenvalues 
are poSitive. Let he > 0 be an eigenvalue of A and Pp; >? 0 
be an eigenvalue of X'X. Hoerl and Kennard show the rela- 


tionship between ds and p, to be 
AG = 1/(0, + &) (B-3) 


Now there exists an orthogonal p xX p matrix P with P'P = I 


such that 


a) (B-4) 


BaAey-rdtag (ne) pe eine 


Or since the eigenvalues of kA are kh, and the eigenvalues 


of A" are (A,)" 


Met oan 8 m,m ,m,m m.,m : 
P' (kA) P = diag(k,A,, Kr5» a5 0: 3s Sahay (B-5) 
Now 
m m 
P'{ lim(kA) ]P = lim P'(kA) P (B-6) 
mc m>o 
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The right hand side of (B-6) is the limit of the right hand 
elaqe of (B-5). By substituting from equation (B-3) a 
typical diagonal element is 0 = [k/ (pe, + Oe < 1, since 

Oe > 0 for all i=l, 2, .. ., P. Therefore, each of the 


elements of the right hand side of (B-5) approaches zero 


as m approaches infinity. Hence 


P' lim(kA)™ P = 0 (Bar) 


m-co 


iis can only occur if 


lim(kA)" = 0 (B-8) 


1) Daca 


Therefore, the last term of equation (B-2) is zero in the 


limit. Now define a matrix function S = S(kA) where 
SS GS) (B-9) 
i=1 


DeRusso, Roy, and Close [Ref. 10] show that S(kA) converges 
imenand only if S (kA. ) converges for all Kn, , the eigenvalues 
SuekA. Clearly this will occur if and only if 


[kA rca (B-10) 


max! 


ou 


Substituting equation (B-3) 


CAC eee So ae (B-11) 


or, after some algebra 


oH ee -2k and oy 0 (B-12) 
Since Pain is an eigenvalue of a positive definite matrix, 
X'X, then ane 0 and both conditions of (B-13) are met. 


Therefore S(kA) does converge. To see what it converges to, 


define S' = S + I and multiply S' on the left by (I - kA) 
(I - kA)S' = (I - KA)(I + kA + (kA)’ +...) (B-13) 
and multiplying the right hand side out 


Hie Sate SNE | eM po ane o 6 a] 


(I - kA)S' 


= [ (B-14) 


Then 


1 


S' = (I - kA). (B-15) 


a2 


Then 


S$ = [I[(kA)* - r]kay7? - 1 
= (1/K)A*E(a/kyact - 1y7t -4 (B-16) 
paostcituting A = (X'X + Py 
S = [(1/k)X'X + I] [(i/k)x'xy 7! - 1 
: = 
= k(X'X) (B-17) 
Substituting S into equation (B-2) 
. A® -] 
lim 6. = (1/k)k(x'X) ?x'y (B-18) 
m>o 
Therefore 
im 8” = (x'x) I x'y = 8 B-19) 
TT) A O).) CNC & ( 


moo 


Thus the iterative procedure, starting with any prior By» 


converges to the OLS esStimator, 8B. 


APPENDEX C 
MISCELLANEOUS MATRIX ALGEBRA AND CALCULUS 


Let A, B, and C denote m x n matrices. Denote their 


1 = 


inverses by A,B, and ces Fespect ive ly. 


A. MATRIX ALGEBRA 


rinst, note that 


C(A + B)* = Gc + + Be) (C-1) 


since 
C(A + B)> = [(A + B)C™]? (C-2) 
SiGe se ~) oS (a3) 

Also 

Ac’ 2 pet s A (RB + AJB’ = Beate AJA (C-4) 

Since 

hea) 2 Mee Gers 2 bo 

Se eS (C5) 
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and 


lee 


| 
oN 
_ 
I+ 
we 
i 
2 
> 
a 
= 
_ 


(B + AJA 


ll 
co 
> 

i+ 
ee, 
wy 


(C-6) 


B. MATRIX CALCULUS 

Let A(t), B(t), and C(t) denote m x n matrices whose 
elements may be functions of the scalar variable t. Let 
A(t) and B(t) denote the derivatives of A(t) and B(t), 
respectively, with respect to t. 

The following are shown to be true by DeRusso, Roy, 


and Close [Ref. 10]. 


d_A(t)B(t) = A(t)B(t) + A(t)B(t) (C-7) 
and 
date) = -a (ey A(ta? (4) (C-8) 


S5 


Or 


AY. 
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